This project utilized data from the provided newzy dataset which contained text from news articles dating from 2014 to 2020.
The necessary libraries are imported for processing the texts from their as received, raw format (F0) into their Standard Text Analytic Data Model (F2) and to the Machine Learning Corpus Format(F1) formats.
import pandas as pd
import nltk
import numpy as np
import plotly_express as px
Next, the data is imported from a .csv file. This file uses the | character as a delimeter, so that argument is supplied to the pandas method.
df = pd.read_csv('newzy/newzy.csv', sep='|')
To begin exploratory data analysis, content length was explored, as the data description indicated that only some of the sources contained full documents. A logarithmically scaled content length (number of characters) is shown below:
df['content_length'] = np.log10(df.doc_content.str.len())
px.histogram(df.content_length)